# Video Understanding
Test With Sdfvd
A video understanding model fine-tuned based on MCG-NJU/videomae-base, with average performance on the evaluation set (accuracy 50%)
Video Processing
Transformers

T
cocovani
16
0
Internvl3 1B Hf
Other
InternVL3 is an advanced series of multimodal large language models, demonstrating exceptional multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text
Transformers Other

I
OpenGVLab
1,844
2
Datatrain Videomae Base Finetuned Lr1e 07 Poly3
A video understanding model fine-tuned from MCG-NJU/videomae-base, trained on an unknown dataset with an accuracy of 11.1%
Video Processing
Transformers

D
EloiseInacio
13
0
Videomae Base Finetuned 1e 08 Bs4 Ep2
A video understanding model fine-tuned based on MCG-NJU/videomae-base, trained on an unknown dataset
Video Processing
Transformers

V
EloiseInacio
14
0
Qwen2.5 Omni 7B GPTQ 4bit
MIT
A 4-bit GPTQ quantized version of the Qwen2.5-Omni-7B model, supporting multilingual and multimodal tasks.
Multimodal Fusion
Safetensors Supports Multiple Languages
Q
FunAGI
3,957
51
Slowfast Video Mllm Qwen2 7b Convnext 576 Frame96 S1t6
Adopts an innovative slow-fast architecture to balance temporal resolution and spatial details in video understanding, overcoming the sequence length limitations of traditional large language models.
Video-to-Text
Transformers

S
shi-labs
81
0
Videollama2.1 7B AV CoT
Apache-2.0
VideoLLaMA2.1-7B-AV is a multimodal large language model focused on audio-visual question answering tasks, capable of processing both video and audio inputs to provide high-quality question answering and description generation.
Video-to-Text
Transformers English

V
lym0302
34
0
Videomind 2B
Bsd-3-clause
VideoMind is a multimodal agent framework that enhances video reasoning capabilities by simulating human thought processes (such as task decomposition, moment localization & verification, and answer synthesis).
Video-to-Text
V
yeliudev
207
1
Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4
A video multimodal large language model using a slow-fast architecture, balancing temporal resolution and spatial details, supporting 64-frame video understanding
Video-to-Text
Transformers

S
shi-labs
184
0
Tinyllava Video Qwen2.5 3B Group 16 512
Apache-2.0
TinyLLaVA-Video is a video understanding model based on Qwen2.5-3B and siglip-so400m-patch14-384, utilizing a grouped resampler for video frame processing
Video-to-Text
T
Zhang199
76
0
Internvl 2 5 HiCo R16
Apache-2.0
InternVideo2.5 is a video multimodal large language model (MLLM) enhanced by long and rich context (LRC) modeling, built upon InternVL2.5.
Text-to-Video
Transformers English

I
FriendliAI
129
1
Llava NeXT Video 7B Hf
LLaVA-NeXT-Video-7B-hf is a video-based multimodal model capable of processing video and text inputs to generate text outputs.
Video-to-Text English
L
FriendliAI
30
0
Videomae Base Finetuned Signlanguage Last 3
A video understanding model fine-tuned based on MCG-NJU/videomae-base, specialized in sign language recognition tasks
Video Processing
Transformers

V
ihsanahakiim
21
1
Internvl2 5 4B AWQ
MIT
InternVL2_5-4B-AWQ is the AWQ quantized version of InternVL2_5-4B using autoawq, supporting multilingual and multimodal tasks.
Image-to-Text
Transformers Other

I
rootonchair
29
2
Magma 8B
MIT
Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.
Image-to-Text
Transformers

M
microsoft
4,526
363
Smolvlm2 500M Video Instruct
Apache-2.0
A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.
Image-to-Text
Transformers English

S
HuggingFaceTB
17.89k
56
Fluxi AI Small Vision
Apache-2.0
Fluxi AI is a multimodal intelligent assistant based on Qwen2-VL-7B-Instruct, capable of processing text, images, and videos, with special optimization for Portuguese language support.
Image-to-Text
Transformers Other

F
JJhooww
25
2
Internlm Xcomposer2d5 7b Chat
Other
InternLM-XComposer2.5-Chat is a dialogue model trained based on InternLM-XComposer2.5-7B, showing significant improvements in multimodal instruction following and open-ended dialogue capabilities.
Text-to-Image
PyTorch
I
internlm
87
5
Eagle2 2B
Eagle2 is a high-performance vision-language model family introduced by NVIDIA, focusing on enhancing the performance of open-source vision-language models through data strategies and training approaches. Eagle2-2B is the lightweight model in this series, achieving outstanding efficiency and speed while maintaining robust performance.
Text-to-Image
Transformers Other

E
nvidia
667
21
Eagle2 9B
Eagle2-9B is the latest Vision-Language Model (VLM) released by NVIDIA, achieving a perfect balance between performance and inference speed. It is built on the Qwen2.5-7B-Instruct language model and the Siglip+ConvNext vision model, supporting multilingual and multimodal tasks.
Image-to-Text
Transformers Other

E
nvidia
944
52
Llava Mini Llama 3.1 8b
Gpl-3.0
LLaVA-Mini is an efficient multimodal large model that significantly improves the efficiency of image and video understanding by using only 1 visual token to represent an image.
Image-to-Text
L
ICTNLP
12.45k
51
Xgen Mm Vid Phi3 Mini R V1.5 128tokens 8frames
xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model equipped with an explicit temporal encoder, specifically designed for video content understanding.
Video-to-Text
Safetensors English
X
Salesforce
398
11
Mplug Owl3 7B 240728
Apache-2.0
mPLUG-Owl3 is a cutting-edge multimodal large language model designed to tackle the challenges of long image sequence understanding, supporting single-image, multi-image, and video tasks.
Text-to-Image
Safetensors English
M
mPLUG
4,823
39
Minicpm V 2 6 Int4
MiniCPM-V 2.6 is a multimodal vision-language model supporting image-to-text conversion with multilingual processing capabilities.
Image-to-Text
Transformers Other

M
openbmb
122.58k
79
Llava NeXT Video 7B DPO
LLaVA-Next-Video is an open-source multimodal dialogue model, fine-tuned with multimodal instruction-following data on large language models, supporting video and text multimodal interactions.
Text-to-Video
Transformers

L
lmms-lab
8,049
27
Llava NeXT Video 7B
LLaVA-Next-Video is an open-source multimodal dialogue robot, fine-tuned from a large language model, supporting multimodal interaction with video and text.
Text-to-Video
Transformers

L
lmms-lab
1,146
46
Model Timesformer Subset 02
A video understanding model based on the TimeSformer architecture, fine-tuned on an unknown dataset with an accuracy of 88.52%
Video Processing
Transformers

M
namnh2002
15
0
MMICL Instructblip T5 Xxl
MIT
MMICL is a multimodal vision-language model combining blip2/instructblip, capable of analyzing and understanding multiple images while following instructions.
Image-to-Text
Transformers English

M
BleachNick
156
11
Videomae Base Ipm All Videos
A vision model fine-tuned from the VideoMAE base model on an unknown video dataset, primarily used for video understanding tasks, achieving 85.59% accuracy on the evaluation set.
Video Processing
Transformers

V
rickysk
30
0
Videomae Base Finetuned
A video understanding model fine-tuned on an unknown dataset based on MCG-NJU/videomae-base, achieving an F1 score of 0.7147
Video Processing
Transformers

V
sheraz179
15
0
Videomae Base Finetuned
A video understanding model fine-tuned on an unknown dataset based on the VideoMAE base model, achieving 86.41% accuracy on the evaluation set
Video Processing
Transformers

V
LouisDT
15
0
Vivit B 16x2
MIT
ViViT is an extension of the Vision Transformer (ViT) for video processing, primarily used for downstream tasks such as video classification.
Video Processing
Transformers

V
google
989
11
Featured Recommended AI Models